[SPARK

#[SPARK| 来源: 网络整理| 查看: 265

Thank you so much, @jamartinh , @srowen , @HyukjinKwon , and @gatorsmile .

We can distinguish the two existing problems separately here.

First, a) Spark returns incorrect result for an existing Hive table already with skip.header.line.count table property. This is the most common use case which this issue aimed to solve.

Second, more ridiculously, b) Spark can create a table with skip.header.line.count table property and only Hive returns the correct result from that table.

SPARK (Current master branch)

scala> sql("CREATE TABLE t2 (id INT, value VARCHAR(10)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' TBLPROPERTIES('skip.header.line.count'='1')") scala> sql("LOAD DATA LOCAL INPATH '/data/test.csv' OVERWRITE INTO TABLE t2") scala> sql("SELECT * FROM t2").show +----+-----+ | id|value| +----+-----+ |null| c2| | 1| a| | 2| b| +----+-----+

Hive

hive> select * from t2; OK 1 a 2 b

@gatorsmile . I totally agree on the Apache Spark development direction. But, IMO, TBLPROPERTIES or OPTION is not a proper issue in this PR. It's because this PR only updates TableReader.scala to support the existing table property, case a). For TBLPROPERTIES, I simply used that because it's already supported on Spark. I can update the PR description in order to focus on a) instead of b).

Someday later, Apache Spark may delete(or block) TBLPROPERTIES SQL syntax in favor of OPTION syntax. It's okay. It's just a kind of regression on purpose. No problem at all. However, even in that case, we had better read the Hive table with skip.header.line.count correctly.

【本文地址】

公司简介

联系我们

今日新闻

推荐新闻

专题文章